18 research outputs found

    Design and update of a classification system : the UCSD map of science

    Get PDF
    Global maps of science can be used as a reference system to chart career trajectories, the location of emerging research frontiers, or the expertise profiles of institutes or nations. This paper details data preparation, analysis, and layout performed when designing and subsequently updating the UCSD map of science and classification system. The original classification and map use 7.2 million papers and their references from Elsevier’s Scopus (about 15,000 source titles, 2001–2005) and Thomson Reuters’ Web of Science (WoS) Science, Social Science, Arts & Humanities Citation Indexes (about 9,000 source titles, 2001–2004)–about 16,000 unique source titles. The updated map and classification adds six years (2005–2010) of WoS data and three years (2006–2008) from Scopus to the existing category structure–increasing the number of source titles to about 25,000. To our knowledge, this is the first time that a widely used map of science was updated. A comparison of the original 5-year and the new 10-year maps and classification system show (i) an increase in the total number of journals that can be mapped by 9,409 journals (social sciences had a 80% increase, humanities a 119% increase, medical (32%) and natural science (74%)), (ii) a simplification of the map by assigning all but five highly interdisciplinary journals to exactly one discipline, (iii) a more even distribution of journals over the 554 subdisciplines and 13 disciplines when calculating the coefficient of variation, and (iv) a better reflection of journal clusters when compared with paper-level citation data. When evaluating the map with a listing of desirable features for maps of science, the updated map is shown to have higher mapping accuracy, easier understandability as fewer journals are multiply classified, and higher usability for the generation of data overlays, among others

    Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

    Get PDF
    We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts

    Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach

    Get PDF
    <div><p>Background</p><p>We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.</p> <p>Methodology</p><p>Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.</p> <p>Conclusions</p><p>Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.</p> </div

    The ten terms occurring in the largest number of contiguous patches.

    No full text
    <p>Terms are ordered according to the number of patches. Also given is the number of neurons over which those patches are distributed.</p

    Statistics of label placement for top five term dominance levels.

    No full text
    <p>The first three levels correspond to the blue, red-orange, and green layers in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0058779#pone-0058779-g003" target="_blank">Figure 3</a>.</p

    Top ten terms for four neurons along the transect in Figure 4.

    No full text
    <p>Table includes four neurons: the first neuron along the transect, its immediate neighbor, the pivot neuron, and the final neuron. For each neuron, terms are ranked according to their relative term dominance.</p

    Zoomed-out view of the complete map of medical literature, plus detailed views of several regions.

    No full text
    <p>Contents and design as presented to domain experts for qualitative evaluation.</p

    Geometric zooming versus semantic zooming.

    No full text
    <p>Juxtaposed are examples of geometric zooming into the static display of multiple levels optimized for preventing label overlaps (top) versus semantic zooming with successive revealing of lower levels of term dominance (bottom).</p

    Processing steps for visualizing a large corpus of medical literature based on the self-organizing map method.

    No full text
    <p>The figure also references processing steps taken for the study by Boyack et al. (2011), which was centered on cluster quality.</p

    Parallelized batch training of the SOM, with 225 parallel processes.

    No full text
    <p>Included is only the first of a total of 240 sequential batches, with the trained SOM serving as input to the subsequent batch.</p
    corecore